Vttablet schema tracking: Fix _vt.schema_version corruption#13045
Merged
mattlord merged 8 commits intovitessio:mainfrom Jun 19, 2023
Merged
Vttablet schema tracking: Fix _vt.schema_version corruption#13045mattlord merged 8 commits intovitessio:mainfrom
mattlord merged 8 commits intovitessio:mainfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes a race condition which caused protobuf marshaled schema data in
_vt.schema_versionrows to become corrupted when theColumnTypeofquery.Fieldpointers was modified between the time whenFieldmessage sizes were calculated and whenFieldmessage data was written to the buffer.On Vitess versions <= 16, the race condition leads to invalid protobuf data being written to the
schema_versiontable. When the schema historian tries to unmarshal the data it encounters an error, which breaks schema tracking on running tablets, and prevents newly started tablets from serving.On Vitess main the race condition leads to a panic and the
schema_versionrow is never written. This is due to the switch to usingMarshalVTin #12525I'm proposing to fix this by copying the fields returned from schema.Engine before modifying them. As far as I can tell, nothing relied on the side effect of setting
ColumnTypeon the shared fields.This PR also includes a defensive change to acquire the schema.Engine mutex while marshaling the schema to protobuf. This is not strictly necessary to fix the bug, but it could help avoid future race conditions or ones that haven't been discovered yet. I would be happy to remove it if anyone feels that it's unnecessary.
This bug was pretty disruptive for us, so I think the fix should be backported, but note that if it is backported, it may be desirable to change the call to
MarshalVTback toproto.Marshalto match the existing code in released versions.Related Issue(s)
Fixes #12981
Checklist
Deployment Notes